Skip to content

[Bugfix][Multimodal] PyAV video backend returns keyframes labeled as targets#42586

Merged
Isotr0py merged 4 commits into
vllm-project:mainfrom
WindChimeRan:fix/pyav-backend-keyframe-snap
May 14, 2026
Merged

[Bugfix][Multimodal] PyAV video backend returns keyframes labeled as targets#42586
Isotr0py merged 4 commits into
vllm-project:mainfrom
WindChimeRan:fix/pyav-backend-keyframe-snap

Conversation

@WindChimeRan
Copy link
Copy Markdown
Contributor

@WindChimeRan WindChimeRan commented May 14, 2026

Bug

From #39986

backend="pyav" returns the keyframe at-or-before each requested target, labeled as the target. container.seek() defaults to backward=True (snaps to nearest keyframe ≤ pts); the previous loop took the very next decoded frame without advancing to the actual target. Affected workloads: any media_io_kwargs={"video": {"backend": "pyav"}} on long-GOP clips.

Fix

Decode forward until frame.pts >= pts. Reuse the open decoder while targets advance monotonically (the np.linspace common case). Only re-seek on rewind or stream exhaust, so the GOP prefix isn't re-decoded once per target.

Test

Synthesised single-keyframe 200-frame H.264 fixture (green channel = frame index) decoded by both backends:

  • Pre-fix pyav: 8 copies of frame 0 (collapsed onto the lone keyframe).
  • Post-fix pyav: pixel-identical to opencv across all 8 sampled slots, mean|Δ| = 0.0.

Self-contained CPU repro runs in <1s — no model, no dataset, no GPU. Happy to land it as a regression test under tests/multimodal/ if reviewers want; the existing pyav tests only check shape/dtype/count, never pixel content vs the opencv reference, which is why this slipped through.

Minimal Repro
"""Minimal reproduction of the PyAV-backend keyframe-snap bug.

Synthesises a 200-frame H.264 clip with one keyframe (gop_size=200, no
B-frames) and a known signal: each frame N is a solid color whose green
channel equals N. Decodes 8 uniformly-spaced frames via the actual vLLM
loader (`VideoBackend.load_bytes`) for both backends, then recovers
which frame each backend returned by reading the green channel —
independent of the metadata label the loader claims.

Pre-fix: pyav returns 8 copies of frame 0, all labelled as distinct
target indices. Post-fix: pyav matches opencv pixel-for-pixel.
"""

import io
import sys
import types

# `vllm.platforms.cuda` (loaded by vllm/__init__.py) eagerly imports
# vllm._C_stable_libtorch for op-registration side effects. Some local
# editable builds don't include that .so file, but no code reads
# attributes from it. Stubbing lets us import VideoBackend without a
# full rebuild. Drop this hack if your build already provides the
# extension.
sys.modules.setdefault(
    "vllm._C_stable_libtorch", types.ModuleType("vllm._C_stable_libtorch")
)

import av  # noqa: E402
import numpy as np  # noqa: E402

from vllm.multimodal.video import VideoBackend  # noqa: E402

NUM_FRAMES = 200
FPS = 30
WIDTH, HEIGHT = 64, 64
NUM_SAMPLED = 8


def synthesise_long_gop_video() -> bytes:
    """Encode 200 frames of solid-color (green = frame index) with GOP=200."""
    buf = io.BytesIO()
    container = av.open(buf, mode="w", format="mp4")
    stream = container.add_stream("h264", rate=FPS)
    stream.width = WIDTH
    stream.height = HEIGHT
    stream.pix_fmt = "yuv420p"
    stream.codec_context.gop_size = NUM_FRAMES   # one keyframe, at frame 0
    stream.codec_context.max_b_frames = 0        # PTS == decode order
    stream.codec_context.options = {
        "x264-params": "scenecut=0:keyint=200:min-keyint=200"
    }

    for i in range(NUM_FRAMES):
        img = np.zeros((HEIGHT, WIDTH, 3), dtype=np.uint8)
        img[:, :, 1] = i                         # green channel encodes index
        frame = av.VideoFrame.from_ndarray(img, format="rgb24")
        for packet in stream.encode(frame):
            container.mux(packet)
    for packet in stream.encode():
        container.mux(packet)
    container.close()
    return buf.getvalue()


def count_keyframes(data: bytes) -> int:
    with av.open(io.BytesIO(data)) as c:
        s = c.streams.video[0]
        return sum(1 for p in c.demux(s) if p.is_keyframe)


def main() -> None:
    data = synthesise_long_gop_video()
    n_keyframes = count_keyframes(data)
    print(f"Encoded {NUM_FRAMES} frames, {len(data)} bytes, "
          f"{n_keyframes} keyframe(s).")
    assert n_keyframes == 1, (
        f"Test fixture is invalid: expected 1 keyframe, got {n_keyframes}. "
        "The bug is only visible when sampled frames span a long GOP."
    )

    opencv_frames, opencv_meta = VideoBackend.load_bytes(
        data, num_frames=NUM_SAMPLED, backend="opencv"
    )
    pyav_frames, pyav_meta = VideoBackend.load_bytes(
        data, num_frames=NUM_SAMPLED, backend="pyav"
    )

    requested = list(opencv_meta["frames_indices"])
    assert requested == list(pyav_meta["frames_indices"]), (
        f"Backends disagree on metadata indices "
        f"({opencv_meta['frames_indices']} vs {pyav_meta['frames_indices']}) "
        "— separate problem."
    )
    print(f"Both backends report frames_indices: {requested}")

    # Recover the frame each backend ACTUALLY returned by reading the green
    # channel of the centre pixel. Independent of the metadata label.
    opencv_actual = [int(f[HEIGHT // 2, WIDTH // 2, 1]) for f in opencv_frames]
    pyav_actual = [int(f[HEIGHT // 2, WIDTH // 2, 1]) for f in pyav_frames]

    print()
    print(f"{'slot':>4} {'requested':>10} {'opencv got':>11} {'pyav got':>9} "
          f"{'mean|Δ|':>9}")
    print("-" * 50)
    for i, idx in enumerate(requested):
        diff = float(np.mean(np.abs(
            opencv_frames[i].astype(int) - pyav_frames[i].astype(int))))
        print(f"{i:>4} {idx:>10} {opencv_actual[i]:>11} "
              f"{pyav_actual[i]:>9} {diff:>9.2f}")

    print()
    opencv_unique = len(set(opencv_actual))
    pyav_unique = len(set(pyav_actual))
    print(f"opencv returned {opencv_unique} distinct frames")
    print(f"pyav   returned {pyav_unique} distinct frames")
    if opencv_unique == pyav_unique == NUM_SAMPLED:
        print("\nPASS: pyav and opencv agree, all requested frames returned.")
    elif pyav_unique < opencv_unique:
        print(f"\nFAIL: pyav collapsed {NUM_SAMPLED} requested slots onto "
              f"{pyav_unique} distinct frame(s) — keyframe-snap bug present.")
    else:
        print(f"\nFAIL: backends disagree (opencv={opencv_unique}, "
              f"pyav={pyav_unique} distinct frames).")


if __name__ == "__main__":
    main()

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Ranran <hzz5361@psu.edu>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added multi-modality Related to multi-modality (#4194) bug Something isn't working labels May 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes video frame decoding in vllm/multimodal/video.py by reusing the decoder iterator when target frames advance monotonically, which avoids redundant decoding of GOP prefixes associated with per-frame seeking. The docstring for decode_frames was also updated to reflect this shift from keyframe decoding to forward decoding to PTS. There are no review comments to address, and I have no feedback to provide.

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a regression test for this?

Comment thread vllm/multimodal/video.py Outdated
Signed-off-by: Ranran <hzz5361@psu.edu>
Signed-off-by: Ranran <hzz5361@psu.edu>
@WindChimeRan
Copy link
Copy Markdown
Contributor Author

@Isotr0py added regression test

@WindChimeRan WindChimeRan requested a review from Isotr0py May 14, 2026 04:40
Comment thread tests/multimodal/test_video.py Outdated
Signed-off-by: Ranran <hzz5361@psu.edu>
@WindChimeRan WindChimeRan requested a review from Isotr0py May 14, 2026 05:00
@Isotr0py Isotr0py enabled auto-merge (squash) May 14, 2026 05:44
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 14, 2026
@Isotr0py Isotr0py merged commit f3d5360 into vllm-project:main May 14, 2026
54 of 55 checks passed
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request May 18, 2026
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
Liuweixiong0118 pushed a commit to Liuweixiong0118/vllm that referenced this pull request Jun 1, 2026
…targets (vllm-project#42586)

Signed-off-by: Ranran <hzz5361@psu.edu>
Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants